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Abstract.^ The scheme of the shding window is known in Infor- 
mation Theory, Computer Science, the problem of predicting and in 
stastistics. Let a source with unknown statistics generate some word 
. . . X-_ia;oXiX2 • • • in some alphabet A. For every moment t,t = ... 
— 1, 0, 1, . . ., one stores the word ("window") xt-wXt-w+i ■ ■ ■ xt-i where 
w,w > 1, is called "window length". In the theory of universal coding, 
the code of the xt depends on source ststistics estimated by the window, 
in the problem of predicting, each letter xt is predicted using information 
of the window, etc. After that the letter xt is included in the window 
on the right, while Xf-w is removed from the window. It is the sliding 
window scheme. This scheme has two merits: it allows one i) to estimate 
the source statistics quite precisely and ii) to adapt the code in case of 
a change in the source' statistics. However this scheme has a defect, 
namely, the necessity to store the window (i.e. the word xt-w ■ ■ ■ xt-i) 
which needs a large memory size for large w. A new scheme named 
"the Imaginary Sliding Window (ISW)" is constructed. The gist of this 
scheme is that not the last element xt-w but rather a random one is 
removed from the window. This allows one to retain both merits of the 
sliding window as well as the possibility of not storing the window and 
thus significantly decreasing the memory size. 

Keywords. randomized data structure, storage and search of infor- 
mation, prediction, randomization, data compression. 



1 Introduction 

There are many situations when people deal with information sources with unknown 
or changing ststistics. Among them we mention data compression [2], the problem of 
predicting [14] and the similar problem of prefetching [15], the problem of adaptive search 
[9] as well as a statistical estimation of parameters of the information sources. There 
are many interesting ideas and algorithms for solving these problems. For example, for 
encoding of information sources with unknown or changing statistics different methods 
are used to adapt a code to statistics of a source. Many of such methods are based on 
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context - tree weighting procedure [10], on Lempel - Ziv codes [16], see also the review in 
[2] , a bookstack scheme [11]^, on a scheme of sliding window and some others. 

Some of these methods are used not only for information sources encoding, but also 
for solving problems connected with storage and search of the information, see [1], [9], 
[15] as well as with statistical estimation of parameters of changing random processes. 

The scheme of sliding window is quite popular and may be used jointly with the 
adaptive Huffman code [7], the arithmetic code [2], the interval code [4], the fast code 
[13], as well as for predicting [14] and prefetching [15] and other algorithms. 

Let us define the sliding window scheme (SW). A source generates the word 
. . . X-1X0X1X2 ... in a finite alphabet A. There is a computer which stores the window 
Xt-wXt-w+i ■ ■ -Xt-i of the moment t, where w{w > 1) is the length of the window. The 
computer uses the window in order to estimate the source statistics. After that the 
computer moves the window as follows: the letter Xt is included in the window on the right, 
while the letter Xt^^ is removed from the window. If the SW is used for data compression 
there exist two computers (an encoder and a decoder). The decoder conducts the same 
operation with the window and this allows to decode a message definitely. Naturally, the 
greater w, the more precise the estimate of statistics of the source. 

Let us consider as an example the problem of encoding the Bernoulli source generating 
letters from some alphabet A — {ai, 02, ... , Om}- In this case, the redundancy per letter 
for the best universal code is not less than (m — l)/2-u; + 0{l/w) when w is the length 
of the window (see [8]). (In case of predicting and prefetching the redundancy is equal to 
the precision of the prediction). Hence, to achieve smaller redundancy or precision, the 
length of the window has to be much greater than the number of letters in the alphabet 
A. Obviously, to keep the window, the encoder and the decoder need wlogm bits of 
memory. 

By z/*(a) we denote the frequency of occurence of the letter a in the window 
for every a E A . It is known that a vector {i^*(a), a E A} is a. sufficient statistic for the 
Bernoulli source (see [6] for example). Informally, this means that this vector implies 
that all the information is contained in Xt-w ■ ■ -Xt-i- However to keep frequencies only 
mlog-u; bits are needed (which is exponentially less then wlogm). In the sliding window 
scheme after the encoding of the recurrent letter Xt the relevant frequency increases by 
1 (u^^^lxt) = I'ixt) + 1) while the frequency of the letter Xt-w, being removed from the 
window, decreases by 1. 

Informally, in the ISW scheme, we propose only to keep the vector of frequencies 
(p*(ai), z>*(a2), . . . , z/*(am)), Z^i^i '^*(fli) = w. After the coding of the recurrent letter Xt 
we increase its occurrence by 1 (as before, u*^^{xt) = i^^{xt) + 1) and then we decrease 
the occurrence of a randomly selected letter by 1, where the probability of decreasing the 
occurrence of the letter a E A is equal to v^{a)/w. 

Thus, in the Imaginary Shding Window scheme only the vector of integers (P*(ai), . . . , 
^*{0"m)) is kept. After the encoding of Xt the number V*{xt) increases by 1 and one 
randomly selected number decreases by 1. 

^Thc bookstack scheme was proposed in authors' paper [11] and then rediscovered in 1986-1987 in [3] 
and [4] (see also [12]). Now this scheme is usually named " move-to- front scheme" as proposed in [3]. 



It turns out that this scheme possesses properties which are similar to the usual sliding 
window : first, the distribution of the vector (z/*(ai), . . . , i^*{am)) is similar to the one for 
the sliding window and, second, under the changing of the source statistics, the adaptation 
of the vector (i/*(ai), . . . , i?*(am)) occurs. 

The construction of the ISW scheme is based on the use of random numbers, or, more 
precisely, on a sequence of random equi - probable independent binary digits. 

There are two usual ways to obtain random digits: the first, using a table of random 
digits, and, the second, using a generator of random digits. Both of them may be used for 
predicting and statistical estimation. However in case of data compression the sequence 
of digits has to be the same in the encoder and the decoder. It is possible to use the third 
way to obtain random digits. In this case the random digits are not ideal, but they may 
be obtained free "of charge". The point is that the encoded, "compressed" sequence is 
"nearly" random. Moreover, the less the code redundancy, the nearer encoded message 
is to a sequence of random bits generated by tossing a symmetric coin. We recommend 
to use as random digits that part of the message being generated by a source, which is 
already encoded. It is important that this sequence is known by the encoder and the 
decoder, and while coding, it is sufficient to keep only a small current piece of it. 

We turn our attention to the remainder of the paper. In Section 2 the asymptotic 
equivalence of the scheme of the imaginary sliding window with the usual scheme of the 
sliding window is proved and the extention to the case of Markovian sources is given. 
In Section 3 a fast method to use random digits which are necessary for the Imaginary 
Sliding Windows scheme is proposed. Using this method allows the processes of encoding 
and decoding to procede without delay. 

2 The Scheme of the Imaginary Sliding Window 

Let us give some necessary definitions. Denote the set of words of the length k in the 
alphabet A as A'' ,{k > 0). Let Q^o be the set of all ergodic and stationary sources 
generating letters from A. For uj e floo, a & A, k > 0, u & A^ denote by P^{a/u) 
the probability that letter a is generated next by the source cu in the case when the word 
u & A'^ is generated by the same source. According to the definition, /i will be the memory 
of the source in the case if for all letters a, ui,U2, ■ ■ ■ ,Uk & A, k > /i the equality 

Pu,{a/ui ...Uk)= Puj{a/ui ...U/j) 

is valid ( when = a source is said to be a Bernoulli source). Denote as Q^, > 0, a 
set of all UJ G Qoo with the memory /i. 

Let us describe the scheme of the ISW for the case of a Bernoulli source. Let 
X1X2 . . .Xt . . . be the sequence being generated by some Bernoulli source a; e Qq- Let 

w > 1, be the window length and let for any integer t a word Xt-wXt-w-^\ ■ ■ ■ ^t-i be 
the window at the moment t. 

Denote as z/* the number of occurences of the letter aj G A in the window Xt-w ■ ■ ■ Xt-i- 
It is easy to see that (i/^, . . . , i/^) is a random vector governed by the multinomial distri- 



bution 



P{pi = ni,pi = n2, . . . ,1^1 = Urn} = { J n 
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In the construction of the ISW only the set of integers (without a window) P* = 
(pj, P^, . . . , P^) is stored, which changes after encoding of every letter Xt. To describe 
the rules of changing the vector P* let us denote the random value £* being the vector 
1, 2, . . . , m with the probabilitis Pi/w, y\lw^ . . . , P^/w, respectively, i.e. 

P{e'^i}^Pl/w, i = l,2,...,m (2) 

After encoding every letter Xt of the message XiX2 ■ ■ - Xt, first, is generated and then 
the conversion from the vector P* to the vector P*+^ is conducted : 



t^j - 1, if j = 

i^j + 1, if cij = Xt (3) 
z/j, if j 7^ and aj ^ Xt or j = e* and Xt = aj 



In other words, firstly, one random chosen coordinate of the vector P* is decreased by 
1. (This operation is analogous to decreasing a counter which corresponds to Xt^^ , by 1, 
when the window moves from Xt-w ■ ■ ■ Xt-i to Xt_u)+i ■ ■ - Xt and, instead of removing Xt-wi 
one random chosen letter is "thrown out" from the window). Second, the coordinate of 
the vector P* corresponding to the letter Xt is increased by 1. 

The initial distribution P° = (p",...,P^) may be arbitrary chosen. For example, 
P° = w/m may be assumed when i = l,...mif-u7/mis integer. 

Now, we investigate the propeties of the ISW. First, we demonstrate that the distribu- 
tion of the vector P* asymptotically complies with (3), i.e., it is the same as the distribution 
of the frequency of the occurence of letters in the scheme of the sliding window. 

Theorem 1. Let a Bernoulli source be given which generates letters from the alphabet 
A = {tti, . . . , ttm} with probabilities P(ai), . . . , P{am), and let nin2, . . . , rim be any integer 
nonnegative numbers such as J2''^i = w, w > 1. Then for the scheme of the ISW with the 
vector of frequencies P = (pj, . . . , P^) the following equality is valid for any initial vector 

limP{p^ = ni,...,P^ = n„|= ( ^nP(ai)"^ 

Proofs of all theorems are given in the appendix. 

Hence, values (P^, P2, . . . , P^) may replace values of frequency of occurence of letters 
in an usual sliding window and then the ISW may asymptotically replace the SW. 

The rate of convergence of the distribution (pj, . . . ,P^) to multinomial distribution 
(1) comes into the question. This is of importance, because this rate effects the rate 
of adaptation of the ISW to modifications of statistics. (In fact, we may assume the 
statistics change at the moment t = 0). We mention two conclusions characterizing the 



rate of approximation of frequencies (z/^, . . . , z/^) to the limit distribution, presupposing 
that the vector (z/°, • • • , i^m) chosen arbitrarily. For simplicity sake, let us define 
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(Such a definition is based on Theorem 1). 

In Information Theory and Statistics there is well known the Kullback-Lcibler Di- 
vergence estimating the divergence of the two distributions of probabilities. The next 
Theorem allows estimating the divergence of the distribution (^i, . . . ,P^) to (4). 

Theorem 2. Suppose a Bernoulli source generating letters from the finite alphabet 
A — {ox, . . . , ttm} is given and we use the scheme of ISW with the "window length" w. 
Let i?* be KuUback-Leibler Divergence between distributions of probabilities of the vector 
of frequencies (pj, . . . , P^) and (P^, . . . , P^), defined by the equation 

Then, under any initial distribution of frequencies . . . , P^) the inequality 

«'<-logfef!l(-lffl--V) (6) 





is valid. 

The right part in (6) is rather cumbersome. For large t and w the following asymptotic 
estimate is valid: 

Corollary. Let i — > oo and let 

X^w e-'l"". (7) 

Then R < A + o(A). 

It readily follows from the corollary above that i?* becomes small when t > wlogw. 
If, for example, t = wlogw + bw then i?* is close to e"**. 

Thus, the ISW "remembers" the initial distribution of probabilities during a period 
which is approximately equal to wlogw. An "usual" SW can "remember" the initial 
distribution of probabilities till total renewal of its contents, i.e. till t = w. 

To encode a source as well as to use the schemes of SW (and ISW) in many other 
applications the estimates of probabilities P{ai), . . . P{am) are used and the values P*/^' 
i — 1, . . . ,m (or similar ones) are used as these estimates. The next Theorem allows 
estimation of the proximity of P*/w to P{ai). 

Theorem 3. Under fulfilment of the hypothesises of the Theorem 2, 



E{vl/w) - P{ai) |< e-*/"' for z = 1 



It readily follows from this that the average value of estimates of probabilities of the 
letters a e A obtained by using ISW, quite rapidly approximate to the proper value 



of P{a) under increasing t. It is important for application of ISW because t may be 
interpreted as tiie time duration after modification of statistics (at the moment t = 0). 

Now, we shall apply the scheme of ISW to the case of Markovian sources. Let /i > 1 
and it is known that cu & ftfj,. The construction of ISW described above may be applied 
to this case in such a way as while encoding and decoding, we store | A |^ imaginary 
windows and each of them corresponds to one word from A^. Futhcrmorc, in the memory 
of the encoder and the decoder one "real" window is kept, consisting of letters, and the 
last /J, letters encoded are stored in this window. For example, let a source generate the 
message xiX2 . . .Xt . . . Then, before encoding xt there are letters Xt-ij, ■ ■ ■ Xt-i stored in 
the "real" window. This word belongs to A^^, hence, the ISW corrcspondsing to it exists 
and the letter Xt is encoded in accordance with the information stored in this window. 
After encoding of Xt, the same mapping are made with the ISW which corresponds to 
Xt-n ■ ■ -Xt-i, as in the Bernoulli case described above. (One randomly chosen frequency 
decreases at 1, and a frequency corresponding to Xt increases at 1). 

Let us consider an example which explains the described construction. Let A = {0, 1}, 
H = 2 and let 001011 be the sequence being encoded. The encoder and the decoder 
keep in their memory 2^ = 4 imaginary windows, and each of them consists of two 
nonnegative numbers which, in sum, are equal to the "window length" w {w is any 
positive number). The first number corresponds to the frequency of occurence of the 
letter in the window, and the second number corresponds to the frequency of occurence 
of the letter 1. The letter ^3 which follows after 00 is coded on the basis of the window 
corresponding to the word 00, the letter x^ is coded on the basis of the contents of the 
window 01, etc. 

Thus, while encoding a source of memory /i we use the method which is well known 
in Information Theory : represent the Markovian source as a population of Bernoulli 
sources. Due to this, every letter generated by a source, is encoded and decoded according 
to the information which is stored in the window corresponding to the relevant Bernoulli 
source. 

3 Fast Algorithm for Transformation of the ISW 

After the coding of every letter of a message transformations of frequences of ISW arc 
conducted: one frequency increases by 1 and another, randomly chosen, decreases by 1. 
In this section a simple and fast algorithm of realization of random choice is considered. 

Let any generator of random bits generate the sequence z — Z1Z2 ■ ■ ■ Zk which consists of 
symbols from the alphabet {0, 1}. We do not estimate the complexity of generating these 
symbols, and consider only the method of transformation of the random bits to meanings 
of the random values £* (see (4)) which are used for random choice of the frequency being 
decreased by 1. 

Let us give some definitions to start describing an algorithm. For simplicity sake, we 
shall suppose that the window length w and the number of letters of the alphabet m may 



be represented as 

w = 2", m = 2'' (8) 

when u and are integers. Let z/* = (z/^, . . . , i/^) be an integer- valued vector characterizing 
the imaginary window.For generating a meaning of a random value £* first, u random bits 
Zi . . . Zu are produced, and let 

z^j:z,2--^ 

3=1 

That shows, along with (8), that z, with the same probability, may be equal to any value 
from the set {0, 1, . . . , w — 1}, i.e. 

Let us define 

Qi-O, g, = E^L J = 2,...,m+1 (10) 

k=l 

Let us consider the random value £* with the meanings j, 1 < j < m if two inequalities 
hold: 

Qj<z< Qj+i (11) 

Prom this definition follows: 

P{£* = j} = P{Qj <z< Qj+i} = {Qj+i - Qj)/w = Vj/w 
(Here the second and the third equalities follow from (9) and (10)). Hence, we obtain 

P{e' = j] = v]lw 

which is the same as the definition (2). Prom this, it follows that the given method 
of generating the random value e* is quite correct, however it is rather complex. The 
point is that after encoding of the recurrent letter Xt from the message two frequencies 
must be changed (one has to be increased by 1, and one has to be decreased by 1). 
After that, in turn, the values {Qj} must be calculated. In the case of a large source 
alphabet, calculation of the value Qj (according to (10)) and searching j (according to 
(11)) may take too much time. More exactly, 0(m logiy) operations over one-bit words 
when m ^ oo are needed. 

In conclusion of this section the description of the algorithm which allows earring out 
all operations with ISW during the period 0(log m log w) for large m and i(;,is given. This 
algorithm is close to the fast letter-by-letter code from the author's paper [13]. 

Por description of the method let us define: 

^1. = ^]' / = l,2,...,m; S*^^. = + 4-1,2,, A: = 2,...,/x; j = 1, . . . , m/2'= 

(12) 



These values are stored in the memory of the encoder and the decoder. When generating 
the meaning of the random value according to instead of (11) we use the following 
algorithm: first, let us check the inequality 

^ < (13) 

If it is vahd, it means that 1 < £* < m/2, otherwise m/2 + 1 < £* < m. Then, if the 
inequality (13) holds, we check the inequality 

^ < (14) 
Otherwise we calculate z — z — E^^^ and check the following condition: 

^ < s;,_i,3 (15) 

If (14) holds we evaluate whether the inequalities 1 <£* < m/A or (m/4 + 1) < < m/2 
hold. If (15) holds, we obtain (m/2) + 1 < 3m/4 or (3m/4 + 1) < < m. Continuing in 
that way, after logm = steps we shall evaluate £*. Besides, at every step, it is necessary 
to make one comparison and, possibly, one subtraction of numbers each of which has the 
form of a word of the length u = log w bits. Thus, the general number of operations over 
singlebit words is equal to 0(logm logw). 

Now, let us describe the "fast" method of conversion from to Let under 

conversion from t tot + 1 any j-coordinate of the vector z/* increases by 1 and k-coordinate 
decreases (j, k E {1,2, . . . , m}, see (5)). Then we have to increase and decrease by 1 one 
value from the sets 

i = X,...,m/2}, i = l,m/4}, . . . , t = 1,2} 

i.e. make /i — logm operations of addition of 1 and // = logm operations of subtraction 
of 1. Each operation of addition and subtraction is made over the numbers of length u = 
logw, so the general number of operations over singlebit words is equal to 0(logm logtf ). 
Thus, when using the fast method proposed the number of operations after encoding of 
a recurrent letter under transformations of ISW, is equal to 0(logm logw). 



Appendix 

Proof of Theorem 1. 

Denote by S* a set of vectors of the form of = (5*1, . . . , Sm) such that all Sj are 
positive integers, and H^iSj = w. Let us consider a Markov chain M, states of which 
coincide with elements of S and a matrix of probabilities of conversion is defined by the 
equality 

P{ai)aj/w if ai = 5i, . . . , 5i = ai + 1, 5j = aj - 1 
E'^^^P{ak)6k/w if di = ai, 62 = a2, . . . ,6m = CTm (16) 
for another a, 5 



Pa,5 



This Markov chain simulates the behaviour of ISW. 

Using a standard technology of Markov chains ( mentioned, for example, in [5]), it is 
easy to test the assumption that limit probabilities for M exist and are established by 
the equality 



which proves Theorem 1. 
Proof of Theorem 2. 

Let us introduce a new scheme — the sliding window with random removing of ele- 
ments (SWRRE). In this scheme a sequence of w "boxes" {w > 1) is considered. Every 
"box" may contain a letter from the alphabet A. As above, a Bernoulli source is given, 
generating the sequence X1X2 ... ,Xj £ A for all j, and let P{a) be the probability of 
generating the letter a & A. 

In the initial moment there are letters from A in the "boxes". At every moment 
t — 1,2,... two operations are made: a random value i^* is produced which is equal to 
1,2, .. .w with the probability 1/w each, and the letter from the box number i/* is removed. 
Then a value is produced which is equal to 1, 2, ... m such as 



and the letter a is located in the box which became free. 

It is easy to see that the scheme SWRRE is an exact but more detailed model of 
the scheme ISW. In fact, denote by vl a random value which is equal to the number of 
boxes containing the letter ak at the moment t and let — . . . , i/^). It follows from 
the scheme SWRRE described that the probabilities of conversion from to z/*^^ are 
also defined by the equality (16). Hence, if the initial distribution is the same for both 
schemes ISW and SWRRE (i.e. = u^), then the distribution of probabilities for all 
other moments will be the same: for any n — (ni, . . . , Um) 



Then let us introduce a new random value which is connected with SWRRE. By 
definition, (/?* is equal to the number of boxes from which letters were not removed at the 
moments 1,2, ... ,t. Note immediately that the distribution of this value is well known 
(see, for example, [5]), when cp* is the number of empty boxes obtained after the random 
distribution of t elements in w boxes). It is known that 




PW = A;} = P(afc) 



P{i/* = n} = P{i>* = n}. 



(17) 




(18) 



(see [5]). 

Let in the scheme of SWRRE all letters be replaced in all boxes (i.e. = 0) at time t. 
Then, obviously, the distribution of the vector does not depend on t and it is subjected 



to the multinomial distribution: 



P {.' = «„..., = n„/^' = 0} = J U PM"'- 

That yields, along with (4), 



P{p°° = n} = P{i/* = ri/(/?* = 0}. 

It follows from this that 

P{i/* = n} > P{i/°° = n} P{ip' = 0}. 
From this and (17) we have 

P{u' ^n}> P{u°°} P{(p^ = 0}. 
That yields, along with the definition of P* (5), 

P* < VPI^"" = n}log ^^^T^fl - 

- 5] P = n} log P{ip' = 0}. 
Pes 

From this and (18) we obtain (6). 
The Theorem is proved. 

The proof of the Corollary readily follows from the known estimates of the number of 
empty boxes (see [5]). 

The proof of Theorem 3. Denote by tt* the probability that contents of some definite 
box did not transform at the moments 1,2, .. .t. Then, it is easy to see that for any box, 

7r* = (l-l»* (19) 

Let us fix some letter at & A and define the random value which is equal to 1 in the 

case if at the moment t the k box contains ai,k — 0, . . . ,w. Then it is easy to see that 
E{ei) = (1 - TT*) P{ai) + TT* • E{el) and 



HE ©I 

k=l 



From the latter equalities we obtain that 

E{ul) = w{l - TT*) P{ai) + w 7r*P(e^). (20) 
From the obvious inequality < P(0°) < 1 and from (20) we have 

w(l - TT*) P{ai) < E{vl) < w{l - n^)P{ai) + w tt*. 
From this, we obtain 

-w vr* < E{vl) - P{ai)w < w tcK 

That yields, along with (19) and the known inequality (1 — £) < the conclusion of the 
Theorem 3. 
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